ABSTRACT
While loop reordering and fusion can make big impacts on the constant-factor performance of dense tensor programs, the effects on sparse tensor programs are asymptotic, often leading to orders of magnitude performance differences in practice. Sparse tensors also introduce a choice of compressed storage formats that can have asymptotic effects. Research into sparse tensor compilers has led to simplified languages that express these tradeoffs, but the user is expected to provide a schedule that makes the decisions. This is challenging because schedulers must anticipate the interaction between sparse formats, loop structure, potential sparsity patterns, and the compiler itself. Automating this decision making process stands to finally make sparse tensor compilers accessible to end users.
We present, to the best of our knowledge, the first automatic asymptotic scheduler for sparse tensor programs. We provide an approach to abstractly represent the asymptotic cost of schedules and to choose between them. We narrow down the search space to a manageably small Pareto frontier of asymptotically non-dominating kernels. We test our approach by compiling these kernels with the TACO sparse tensor compiler and comparing them with those generated with the default TACO schedules. Our results show that our approach reduces the scheduling space by orders of magnitude and that the generated kernels perform asymptotically better than those generated using the default schedules.
- Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16).Google ScholarDigital Library
- Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to optimize halide with tree search and random programs. ACM Trans. Graph., 38, 4 (2019), July, issn:0730-0301 https://doi.org/10.1145/3306346.3322967 Google ScholarDigital Library
- Luke Anderson, Andrew Adams, Karima Ma, Tzu-Mao Li, Tian Jin, and Jonathan Ragan-Kelley. 2021. Efficient automatic scheduling of imaging and vision pipelines for the GPU. Proc. ACM Program. Lang., 5, OOPSLA (2021), Oct., https://doi.org/10.1145/3485486 Google ScholarDigital Library
- J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U. O’Reilly, and S. Amarasinghe. 2014. OpenTuner: An Extensible Framework for Program Autotuning. In International Conference on Parallel Architectures and Compilation Techniques (PACT). isbn:978-1-4503-2809-8 https://doi.org/10.1145/2628071.2628092 Google ScholarDigital Library
- Gilad Arnold. 2011. Data-Parallel Language for Correct and Efficient Sparse Matrix Codes. University of California, Berkeley.Google Scholar
- Alexander A. Auer, Gerald Baumgartner, David E. Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert Harrison, Sriram Krishnamoorthy, Sandhya Krishnan, Chi-Chung Lam, Qingda Lu, Marcel Nooijen, Russell Pitzer, J. Ramanujam, P. Sadayappan, and Alexander Sibiryakov. 2006. Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Molecular Physics, 104, 2 (2006), Jan., issn:0026-8976 https://doi.org/10.1080/00268970500275780 Google ScholarCross Ref
- Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: a polyhedral compiler for expressing fast and portable code. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization. isbn:978-1-72811-436-1Google ScholarCross Ref
- Prasanna Balaprakash, Jack Dongarra, Todd Gamblin, Mary Hall, Jeffrey K. Hollingsworth, Boyana Norris, and Richard Vuduc. 2018. Autotuning in High-Performance Computing Applications. Proc. IEEE, 106, 11 (2018), Nov., issn:0018-9219, 1558-2256 https://doi.org/10.1109/JPROC.2018.2841200 Google ScholarCross Ref
- Jérémy Barbay, Alejandro López-Ortiz, Tyler Lu, and Alejandro Salinger. 2010. An experimental investigation of set intersection algorithms for text searching. ACM J. Exp. Algorithmics, 14 (2010), Jan., issn:1084-6654 https://doi.org/10.1145/1498698.1564507 Google ScholarDigital Library
- Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman. 2012. Julia: A Fast Dynamic Language for Technical Computing. arXiv:1209.5145 [cs], Sept..Google Scholar
- Aart J. C. Bik, Penporn Koanantakool, Tatiana Shpeisman, Nicolas Vasilache, Bixia Zheng, and Fredrik Kjolstad. 2022. Compiler Support for Sparse Tensor Computations in MLIR. arXiv:2202.04305 [cs], Feb..Google Scholar
- Aart J. C. Bik and Harry A. G. Wijshoff. 1993. Compilation techniques for sparse matrix computations. In Proceedings of the 7th international conference on Supercomputing. isbn:978-0-89791-600-4 https://doi.org/10.1145/165939.166023 Google ScholarDigital Library
- Aart J. C. Bik and Harry A. G. Wijshoff. 1994. Nonzero structure analysis. In Proceedings of the 8th international conference on Supercomputing. isbn:978-0-89791-665-3 https://doi.org/10.1145/181181.181538 Google ScholarDigital Library
- Aart J. C. Bik and Harry A. G. Wijshoff. 1994. On automatic data structure selection and code generation for sparse computations. In Languages and Compilers for Parallel Computing. isbn:978-3-540-48308-3 https://doi.org/10.1007/3-540-57659-2_4 Google ScholarCross Ref
- Aydin Buluc and John R. Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In 2008 IEEE International Symposium on Parallel and Distributed Processing. https://doi.org/10.1109/IPDPS.2008.4536313 Google ScholarCross Ref
- Ashok K. Chandra and Philip M. Merlin. 1977. Optimal implementation of conjunctive queries in relational data bases. In Proceedings of the ninth annual ACM symposium on Theory of computing. isbn:978-1-4503-7409-5 https://doi.org/10.1145/800105.803397 Google ScholarDigital Library
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). isbn:978-1-939133-08-3Google Scholar
- Kazem Cheshmi, Shoaib Kamil, Michelle Mills Strout, and Maryam Mehri Dehnavi. 2017. Sympiler: transforming sparse matrix codes by decoupling symbolic analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. isbn:978-1-4503-5114-0 https://doi.org/10.1145/3126908.3126936 Google ScholarDigital Library
- Lam Chi-Chung, P. Sadayappan, and Rephael Wenger. 1997. On Optimizing a Class of Multi-Dimensional Loops with Reduction for Parallel Execution. Parallel Process. Lett., 07, 02 (1997), June, issn:0129-6264 https://doi.org/10.1142/S0129626497000176 Google ScholarCross Ref
- Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. SIGPLAN Not., 45, 5 (2010), Jan., issn:0362-1340 https://doi.org/10.1145/1837853.1693471 Google ScholarDigital Library
- Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. Format abstraction for sparse tensor algebra compilers. Proc. ACM Program. Lang., 2, OOPSLA (2018), Oct., https://doi.org/10.1145/3276493 Google ScholarDigital Library
- Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2020. Automatic generation of efficient sparse tensor format conversion routines. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation. isbn:978-1-4503-7613-6 https://doi.org/10.1145/3385412.3385963 Google ScholarDigital Library
- R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput., 27, 1 (2001), Jan., issn:0167-8191 https://doi.org/10.1016/S0167-8191(00)00087-9 Google ScholarDigital Library
- M. Frigo and S.G. Johnson. 1998. FFTW: an adaptive software architecture for the FFT. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181). 3, https://doi.org/10.1109/ICASSP.1998.681704 Google ScholarCross Ref
- Shashi Gowda, Yingbo Ma, Alessandro Cheli, Maja Gwozdz, Viral B. Shah, Alan Edelman, and Christopher Rackauckas. 2021. High-performance symbolic-numerics via multiple dispatch. arXiv:2105.03949 [cs], May.Google Scholar
- Goetz Graefe. 1993. Query evaluation techniques for large databases. ACM Comput. Surv., 25, 2 (1993), June, issn:0360-0300 https://doi.org/10.1145/152610.152611 Google ScholarDigital Library
- Johnnie Gray and Stefanos Kourtis. 2021. Hyper-optimized tensor network contraction. Quantum, 5 (2021), March, https://doi.org/10.22331/q-2021-03-15-410 Google ScholarCross Ref
- Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly — performing polyhedral optimizations on a low-level intermediate representation. Parallel Process. Lett., 22, 04 (2012), Dec., issn:0129-6264 https://doi.org/10.1142/S0129626412500107 Google ScholarCross Ref
- Fred G. Gustavson. 1978. Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition. ACM Trans. Math. Softw., 4, 3 (1978), Sept., issn:0098-3500 https://doi.org/10.1145/355791.355796 Google ScholarDigital Library
- Albert Hartono, Qingda Lu, Thomas Henretty, Sriram Krishnamoorthy, Huaijian Zhang, Gerald Baumgartner, David E. Bernholdt, Marcel Nooijen, Russell Pitzer, J. Ramanujam, and P. Sadayappan. 2009. Performance Optimization of Tensor Contraction Expressions for Many-Body Methods in Quantum Chemistry. J. Phys. Chem. A, 113, 45 (2009), Nov., issn:1089-5639 https://doi.org/10.1021/jp9051215 Google ScholarCross Ref
- Rawn Henry, Olivia Hsu, Rohan Yadav, Stephen Chou, Kunle Olukotun, Saman Amarasinghe, and Fredrik Kjolstad. 2021. Compilation of sparse array programming models. Proc. ACM Program. Lang., 5, OOPSLA (2021), Oct., https://doi.org/10.1145/3485505 Google ScholarDigital Library
- Hwansoo Han and Chau-Wen Tseng. 2006. Exploiting locality for irregular scientific codes. IEEE Transactions on Parallel and Distributed Systems, 17, 7 (2006), July, issn:1558-2183 https://doi.org/10.1109/TPDS.2006.88 Google ScholarDigital Library
- Fredrik Kjolstad, Willow Ahrens, Shoaib Kamil, and Saman Amarasinghe. 2019. Tensor Algebra Compilation with Workspaces. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). https://doi.org/10.1109/CGO.2019.8661185 Google ScholarCross Ref
- Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The Tensor Algebra Compiler. Proc. ACM Program. Lang., 1, OOPSLA (2017), Oct., issn:2475-1421 https://doi.org/10.1145/3133901 Google ScholarDigital Library
- Phokion G. Kolaitis and Moshe Y. Vardi. 2000. Conjunctive-Query Containment and Constraint Satisfaction. J. Comput. System Sci., 61, 2 (2000), Oct., issn:0022-0000 https://doi.org/10.1006/jcss.2000.1713 Google ScholarDigital Library
- George Konstantinidis and Jose Luis Ambite. 2013. Scalable containment for unions of conjunctive queries under constraints. In Proceedings of the Fifth Workshop on Semantic Web Information Management - SWIM ’13. isbn:978-1-4503-2194-5 https://doi.org/10.1145/2484712.2484716 Google ScholarDigital Library
- Vladimir Kotlyar. 1999. Relational Algebraic Techniques for the Synthesis of Sparse Matrix Programs. Cornell.Google Scholar
- Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill. 1997. Compiling parallel sparse code for user-defined data structures. Cornell.Google Scholar
- Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill. 1997. A relational approach to the compilation of sparse matrix programs. In Euro-Par’97 Parallel Processing. isbn:978-3-540-69549-3 https://doi.org/10.1007/BFb0002751 Google ScholarCross Ref
- Dimitrios Koutsoukos, Supun Nakandala, Konstantinos Karanasos, Karla Saur, Gustavo Alonso, and Matteo Interlandi. 2021. Tensors: an abstraction for general data processing. Proc. VLDB Endow., 14, 10 (2021), June, issn:2150-8097 https://doi.org/10.14778/3467861.3467869 Google ScholarDigital Library
- Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. isbn:978-1-4503-3559-1 https://doi.org/10.1145/2751205.2751209 Google ScholarDigital Library
- Shangyu Luo, Dimitrije Jankov, Binhang Yuan, and Chris Jermaine. 2021. Automatic Optimization of Matrix Implementations for Distributed Machine Learning and Linear Algebra. In Proceedings of the 2021 International Conference on Management of Data. isbn:978-1-4503-8343-1 https://doi.org/10.1145/3448016.3457317 Google ScholarDigital Library
- John Michael McNamee. 1971. Algorithm 408: a sparse matrix package (part I) [F4]. Commun. ACM, 14, 4 (1971), April, issn:0001-0782 https://doi.org/10.1145/362575.362584 Google ScholarDigital Library
- Mahdi Soltan Mohammadi, Tomofumi Yuki, Kazem Cheshmi, Eddie C. Davis, Mary Hall, Maryam Mehri Dehnavi, Payal Nandy, Catherine Olschanowsky, Anand Venkat, and Michelle Mills Strout. 2019. Sparse computation data dependence simplification for efficient compiler-generated inspectors. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. isbn:978-1-4503-6712-7 https://doi.org/10.1145/3314221.3314646 Google ScholarDigital Library
- Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically scheduling halide image processing pipelines. ACM Trans. Graph., 35, 4 (2016), July, issn:0730-0301 https://doi.org/10.1145/2897824.2925952 Google ScholarDigital Library
- Luigi Nardi, Artur Souza, David Koeplinger, and Kunle Olukotun. 2019. HyperMapper: a Practical Design Space Exploration Framework. In 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). https://doi.org/10.1109/MASCOTS.2019.00053 Google ScholarCross Ref
- Israt Nisa, Charles Siegel, Aravind Sukumaran Rajam, Abhinav Vishnu, and P. Sadayappan. 2018. Effective Machine Learning Based Format Selection and Performance Modeling for SpMV on GPUs. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). https://doi.org/10.1109/IPDPSW.2018.00164 Google ScholarCross Ref
- William Pugh and Tatiana Shpeisman. 1999. SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations. In Languages and Compilers for Parallel Computing. isbn:978-3-540-48319-9 https://doi.org/10.1007/3-540-48319-5_14 Google ScholarCross Ref
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation. isbn:978-1-4503-2014-6 https://doi.org/10.1145/2491956.2462176 Google ScholarDigital Library
- Ari Rasch, Michael Haidl, and Sergei Gorlatch. 2017. ATF: A Generic Auto-Tuning Framework. In 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS). https://doi.org/10.1109/HPCC-SmartCity-DSS.2017.9 Google ScholarCross Ref
- Ryan Senanayake, Changwan Hong, Ziheng Wang, Amalee Wilson, Stephen Chou, Shoaib Kamil, Saman Amarasinghe, and Fredrik Kjolstad. 2020. A sparse iteration space transformation framework for sparse tensor algebra. Proc. ACM Program. Lang., 4, OOPSLA (2020), Nov., https://doi.org/10.1145/3428226 Google ScholarDigital Library
- Michelle Mills Strout, Mary Hall, and Catherine Olschanowsky. 2018. The Sparse Polyhedral Framework: Composing Compiler-Generated Inspector-Executor Code. Proc. IEEE, 106, 11 (2018), Nov., issn:1558-2256 https://doi.org/10.1109/JPROC.2018.2857721 Google ScholarCross Ref
- Michelle Mills Strout, Alan LaMielle, Larry Carter, Jeanne Ferrante, Barbara Kreaseck, and Catherine Olschanowsky. 2016. An approach for code generation in the Sparse Polyhedral Framework. Parallel Comput., 53 (2016), April, issn:0167-8191 https://doi.org/10.1016/j.parco.2016.02.004 Google ScholarDigital Library
- Ruiqin Tian, Luanzheng Guo, Jiajia Li, Bin Ren, and Gokcen Kestor. 2021. A High-Performance Sparse Tensor Algebra Compiler in Multi-Level IR. arXiv:2102.05187 [cs], Feb..Google Scholar
- Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. arXiv:1802.04730 [cs], June.Google Scholar
- Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and data transformations for sparse matrix code. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation. isbn:978-1-4503-3468-6 https://doi.org/10.1145/2737924.2738003 Google ScholarDigital Library
- Anand Venkat, Mahdi Soltan Mohammadi, Jongsoo Park, Hongbo Rong, Rajkishore Barik, Michelle Mills Strout, and Mary Hall. 2016. Automating Wavefront Parallelization for Sparse Matrix Computations. In SC ’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. https://doi.org/10.1109/SC.2016.40 Google ScholarCross Ref
- Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C. J. Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, and Paul van Mulbregt. 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods, 17, 3 (2020), March, issn:1548-7105 https://doi.org/10.1038/s41592-019-0686-2 Google ScholarCross Ref
- Richard W. Vuduc. 2004. Automatic performance tuning of sparse matrix kernels. Ph. D. Dissertation. University of California.Google Scholar
- Yisu Remy Wang, Shana Hutchison, Jonathan Leang, Bill Howe, and Dan Suciu. 2020. SPORES: sum-product optimization via relational equality saturation for large scale linear algebra. Proc. VLDB Endow., 13, 12 (2020), Aug., issn:2150-8097 https://doi.org/10.14778/3407790.3407799 Google ScholarDigital Library
- Ziheng Wang. 2020. Automatic optimization of sparse tensor algebra programs. Massachusetts Institute of Technology.Google Scholar
- Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing. isbn:978-1-59593-764-3 https://doi.org/10.1145/1362622.1362674 Google ScholarDigital Library
- Yongyang Yu, Mingjie Tang, and Walid G. Aref. 2021. Scalable Relational Query Processing on Big Matrix Data. arXiv:2110.01767 [cs], Oct..Google Scholar
- Binhang Yuan, Dimitrije Jankov, Jia Zou, Yuxin Tang, Daniel Bourgeois, and Chris Jermaine. 2021. Tensor relational algebra for distributed machine learning system design. Proc. VLDB Endow., 14, 8 (2021), April, issn:2150-8097 https://doi.org/10.14778/3457390.3457399 Google ScholarDigital Library
- Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018. Bridging the gap between deep learning and sparse matrix format selection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. isbn:978-1-4503-4982-6 https://doi.org/10.1145/3178487.3178495 Google ScholarDigital Library
Index Terms
- Autoscheduling for sparse tensor algebra with an asymptotic cost model
Recommendations
An efficient mixed-mode representation of sparse tensors
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisThe Compressed Sparse Fiber (CSF) representation for sparse tensors is a generalization of the Compressed Sparse Row (CSR) format for sparse matrices. For a tensor with d modes, typical tensor methods such as CANDECOMP/PARAFAC decomposition (CPD) ...
Sparse Tensor Transpositions
SPAA '20: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and ArchitecturesWe present a new algorithm for transposing sparse tensors called Quesadilla. The algorithm converts the sparse tensor data structure to a list of coordinates and sorts it with a fast multi-pass radix algorithm that exploits knowledge of the requested ...
Comments